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' Abstract. The search for similarity and dissimilarity measures on phylogenetic trees has 

00 , been motivated by the computation of consensus trees, the search by similarity in phyloge- 

^~^ . netic databases, and the assessment of clustering results in bioinformatics. The transposition 

' distance for fully resolved phylogenetic trees is a recent addition to the extensive collection 

[•-p'] , of available metrics for comparing phylogenetic trees. In this paper, we generalize the trans- 

^ • position distance from fully resolved to arbitrary phylogenetic trees, through a construction 

. ' that involves an embedding of the set of phylogenetic trees with a fixed number of labeled 

O , leaves into a symmetric group and a generalization of Reidys-Stadler's involution metric for 

f^ • RNA contact structures. We also present simple linear-time algorithms for computing it. 

[ 1 Introduction 

_j. . The need for comparing phylogenetic trees arises when alternative phylogenies are ob- 

(Sl ! tained using different phylogenetic methods or different gene sequences for a given set of 

^^ ' species. The comparison of phylogenetic trees is also essential to performing phylogenetic 

Q ■ queries on databases of phylogenetic trees jHI- Further, the need for comparing phylogenetic 

^O ! trees also arises in the comparative analysis of clustering results obtained using different 

^^ I clustering methods or even different distance matrices, and there is a growing interest in 

O ' the assessment of clustering results in bioinformatics 6' . 

,^ . A number of metrics for phylogenetic tree comparison are known, including the parti- 

qh| tion (or symmetric difference) metric J9I12| . the nearest-neighbor interchange metric J19j . 

the subtree transfer distance |lj, the metric from the crossover method ^3], the quartet 

. metric j3] , the metric from the nodal distance algorithm [^ . One of the simplest and easiest 

r> I to compute metrics proposed so far, the transposition distance J2I' is only defined for fully 

C^ ■ resolved trees. But phylogenetic analyses often produce phylogenies with polytomies, that 

is, phylogenetic trees that are not fully resolved. As a matter of fact, at the time of this 

writing, more than a 66.5% of the phylogenies contained in TreeBASE have polytomies. 

In this paper, we generalize to arbitrary phylogenetic trees this transposition distance, 
through a new definition of it. This new distance is directly inspired on the one hand by 
the matching representation of phylogenetic trees |4ll6j and on the other hand by the 
involution metric for RNA contact structures I11I14I. 



The matching representation M{T) of a phylogenetic tree T = (V, E) with n leaves 
labeled 1, . . . ,n describes T injectively as a partition of {1, . . . , \V\ — 1}. If T is fully 



resolved, which is the particular case considered in ^, then all members of this partition 
are 2-elements sets, and then, since \V\ = 2n — 1, it defines an undirected 1-regular graph 
({!,..., 2n — 2},M(T)). Reidys and Stadler defined the involution metric on 1-regular 
graphs, by associating to each such a graph the permutation given by the product of 
the transpositions corresponding to its edges, and then using the canonical metric in the 
symmetric group SS2n-2 (the least number of transpositions necessary to transform one 
permutation into another) to compare these permutations. The translation of this metric to 
matching representations yields twice the matching distance defined in J^ . Unfortunately, 
no meaningful generalization to arbitrary graphs of Reidys and Stadler's metric is known, 
the main drawback being the difficulty of associating injectively a well-defined permutation 
to an arbitrary graph. 

Now, if T is not fully resolved, the members of M{T) are no longer pairs of numbers, 
and therefore they do not define a graph, at least not directly. Actually, the approach that 
we take in this paper can be understood as if we represented each member {ii, . . . , i^} of 
M{T), with ii < ■ ■ ■ < ik, as a cyclic directed graph with arcs (21,^2), ■ ■ ■ , (ik-i^ik), {i-kih)-, 
and M{T) as the sum of these cyclic graphs. Now, generalizing Reidys-Stadler's approach, 
we associate to every such a cyclic directed graph the cyclic permutation (ii, . . . ,ifc) (if 
fc = 2, it is a transposition), and we describe M{T) by means of the product of the cyclic 
permutations associated to its members: since these members are disjoint to each other, 
this product is well-defined. This defines an embedding of the set of phylogenetic trees with 
n leaves labeled 1, . . . ,n into the symmetric group SS2n~2- The transposition distance is 
obtained by translating the canonical metric on SS2n~2 into a distance for phylogenetic 
trees through this embedding. This transposition distance measures the least number of 
certain simple operations (splitting sets of children, joining sets of children, interchanging 
children) that are necessary to transform one tree into another, and it can be easily com- 
puted in linear time. Therefore it satisfies the requirements of "computational simplicity" 
and "good theoretical basis" that are required to any distance notion on phylogenetic trees 

m- 

2 Matching Representation of Phylogenetic Trees 

Throughout this paper, by a phylogenetic tree we mean a rooted tree with injectively labeled 
leaves and without outdegree 1 nodes. Thus, a phylogenetic tree is a directed finite graph 
T = {V, E) containing a distinguished node r G F, called the root., such that for every 
other node v ^V there exists one, and only one, path from the root r to v. The children of 
a node f in a tree T = {V, E) are those nodes w ^V such that (u, w) G E. The outdegree of 
a node is the number of its children. The nodes without children are the leaves of the tree, 
and the remaining nodes are called internal: since we assume that no node has outdegree 
1, every internal node has at least 2 children. The set of leaves of T is denoted by C{T). 
The height of a node t; in a tree T is the length of a longest directed path from f to a 
leaf. Thus, the nodes with height are the leaves, the nodes with height 1 are the nodes 
all whose children are leaves, and so on. 



The leaves of a phylogenetic tree are injectively labeled in a fixed, but arbitrary, 
ordered set: these labels are called taxa. In practice, if the tree has n leaves, we shall 
identify their labels with 1, . . . , n, ordered in the usual increasing way. The label associated 
to a leaf v £ V will be denoted by i{v). 

We shall denote by 7^ the set of all phylogenetic trees with n leaves labeled 1, . . . , n 
(up to label-preserving isomorphisms of rooted trees). 

Definition 1. The bottom-up ordering (cf. yjl^) of a phylogenetic tree T = iV^E) € 7^ 
is the injective mapping 

£:V^{1,...,\V\} 

defined by the following properties: 

(a) If V £ 'C(T), then i{v) is its label. 

(b) If height{u) < height{v), then i{u) < £{v). 

(c) IfO< height{u) = height{v) and 

min{^(x) I X G children(u)} < min{^(x) [ x € children(u)}, 

theni{u) <£{v). 

It is straightforward to notice that this bottom-up ordering is unique, and it can be 
computed in time linear in the size of the tree by bottom-up tree traversal techniques J18j . 
First, the leaves of T are labeled by their label in {1, . . . , n}. Then, the height 1 nodes are 
labeled from n -|- 1 on in the order given by the smallest label of their children: i.e., the 
height 1 node with the smallest child label is assigned label n + 1, the height 1 node with 
the next-smallest child label is assigned label n + 2, etc. And this procedure is continued 
for consecutively increasing heights. The detailed pseudocode is given in Algorithm ^ 

Example 1. Fig. ^shows the Tree T166cllx6x95c08c56c38 in TreeBASE and its bottom- 
up ordering after sorting its taxa alphabetically. 

The next definition generalizes the perfect matching representation of binary, or fully 
resolved, trees |4|16j . 

Definition 2. Let T = (V, E) be a phylogenetic tree with n leaves labeled 1, . . . ,n, and let 
i : V ^ {1, . . . , \V\} be its bottom-up ordering. The matching representation M{T) of T 
is the partition o/{l,...,|y[ — 1} defined as follows: 

M{T) = {£{children{u)) \u(^V - C{T)}. 

Example 2. The matching representation of the tree in Fig.^is the partition of {1, . . . , 14} 
given by 

{{1, 5, 7, 9}, {4, 6, 10}, {2, 11}, {8, 13}, {3, 12, 14}} . 



begin 

foreach node v of T do 

if u is a leaf node of T then 

I set £{v) to the index of £{v) in L 
else 
\_ £{v) ■- 

i:=\L\ 

foreach level h oi T from the leaves up to the root do 

let 5* be the set of nodes of T at level h, ordered by label 
foreach v £ S do 

let w be the parent of i" in T 
if f (w) = and height{w) = h+ 1 then 
i ~i + l 
£{w) ■- i 

return M 
end 



Algorithm 1: Bottom- up ordering. Given an ordered set L and a phylogenetic 
tree T with leaves bijectively labeled in L, the algorithm computes the bottom-up 
ordering of T. 

It is clear that, once the bottom-up ordering of T has been obtained, the set M{T) can 
be produced in linear time in the size of the tree. Furthermore, the following two results 
are straightforward. 

Corollary 1. For every T = {V,E) G T„, \M{T)\ = \V\ - n. 

Corollary 2. For every Ti,T2 G T^, if M{Ti) = M{T2), then Ti = T2. 

3 The transposition distance 

For every m ^ 1, let SSm denote the symmetric group of permutations of {1, ... , m}. By 
a cycle in SSm we understand a cyclic permutation {ii,i2, ■ ■ ■ ,ik) G SSm, with k ^ 2, 
that sends ii to i2, 12 to ^3,. . . , ik-i to i^^ and ij. to ii, leaving fixed the remaining 
elements of {1, ... , m}. Recall that the inverse of a cycle (ii, ^2, • • • , ^fc) is ih,i2, ■ ■ ■ , ik)~^ = 
{ik,ik-i, ■ ■ ■ ,h)'- the permutation that sends ik to ik-i, ik-i to ik-2,- . . , ^2 to zi, and ii 
to ik- The length of a cycle {ii,i2, ■ ■ ■ , ik) is the number k of elements it moves. 

The cycle associated to a subset S = {ii, . . . ,ik}, with ii < ■ ■ ■ < ik and A: ^ 2, of 
{1, . . . ,m}, is Hi{S) := {ii,i2, ■ ■ ■ ,ik) ^ SSm- If fe = 1, i.e., if 5 is a singleton, then k{S) is 
the identity in SSm, which we do not consider a cycle. 

Definition 3. The matching permutation vr(T) associated to a phylogenetic tree T = 
(V, E) £ Tn is the permutation 0/ {1, ... , \V\ — 1} defined by the product of the sorted 
cycles associated to the members of its matching representation: 

7r(r) = ir K{£{children{u))) - 
uev-c{T) 
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Fig. 1. A phylogenetic tree (left) and its bottom- up ordering (right). 



Example 3. The matching permutation associated to the tree in Fig. ^ is the product of 
cycles 

(1, 5, 7, 9)(4, 6, 10)(2, 11)(8, 13)(3, 12, 14) G SSi^ , 



i.e., the permutation 



12 3 45 6 7 8 91011 12 13 14 
5 11 12 6 710 9 13 1 4 2 14 8 3 



If u,v € V — C(T) are two different internal nodes of T, then i{children(u)) n 
i{children{v)) = 0. Therefore, all cycles K{i(children(u))) appearing in the product defin- 
ing vr(T) are disjoint to each other, and hence they commute with each other, which implies 
that this product is well defined. 

Notice that no element in {1,...,|F| — 1} remains fixed by tt(T), because every 
i{children{u)), with u internal, has at least two elements and every element in {1, . . . , \V\ — 
1} is the bottom-up ordering label of a child of some internal node. Now, if T = {V, E) is a 
phylogenetic tree with n leaves, then \V\ ^ 2n — 1, the equality holding if and only if T is 
binary. To be able to compare matching permutations of phylogenetic trees with the same 
number of leaves n but different numbers of internal nodes, we shall understand hence- 
forth that the matching permutation vr(T) belongs to SS2n-2, leaving fixed the elements 
\V\,...,2n-2. 

The following result is a direct consequence of the facts that the matching represen- 
tation of a phylogenetic tree uniquely determines it and every permutation has a unique 
decomposition as a product of disjoint cycles of length ^ 2. 



Proposition 1. For every Ti,T2 G 7^, if tt(Ti) = ■k{T2), then Ti = T2. 

Remark 1. If we allow the existence of outdegree 1 nodes in our phylogenetic trees, then 
the last proposition is no longer true. Indeed, consider the trees in Fig. [21 The left-hand 
side one has matching representation {{1, 2,3}, {4}}, while the right-hand side one has 
matching representation {{1,2,3}}. Therefore the matching permutation associated to 
both trees is (1, 2, 3) (considered as an element of SS^). 



Fig. 2. Two trees with the same matching permutation. 

Arguing as in Jll' Cor. 1], we have the following result. 

Theorem 1. The mapping that associates to every pair {Ti,T2) of phylogenetic trees with 
n leaves labeled in {1, . . . ,n}, the least number TD'{Ti,T2) of transpositions necessary to 
represent the permutation 7r(T2)~^7r(Ti) € SS2n-2, is a metric on T^. 

Proof. By Proposition ^ the mapping -k : Tn ^ SS2n-2 that sends every T £ Tn to its 
matching permutation vr(T) is an embedding. Then, since the mapping 

dtrans '■ SS2n-2 X SS2n-2 ^ N 

defined by 

dtTansi'^i-,'^2) = the least number of transpositions necessary 
to represent vr^ • vri 

is a metric on SS2n-2 (see, for instance, |1H Thm. 2]), the mapping 

TD' -.TnXTn^n 

{Ti,T2) ^ dtrans{'n-iTi),Tr{T2)) 

is a metric on 7^. D 

Remark 2. Recall that the least number of transpositions required to represent a cycle of 
length /c is /c — 1, for instance through 

(ii, . . . , ifc) = {i\,i2){i2,iz) ■ ■ ■ {ik-i,ik), 

and that the least number of transpositions required to represent a product of disjoint 
cycles is the sum of the least numbers of transpositions each cycle decomposes into, and 
hence the sum of the cycles' lengths minus the number of cycles. 



The metric TD' satisfies the following property. 

Proposition 2. For every Ti, T2 G 7^, TD'{Ti,T2) is an even integer smaller than 2n — 2. 

Proof. If Ti, T2 S %i have mi and m2 internal nodes, respectively, then each Tr{Ti) {i = 1, 2) 
decomposes into mt disjoint cycles: say 'n{Ti) = Ci,i • • • Ci^m^, with dj of length kij. Then, 
by Remark 121 7r(Tj) has a decomposition into 



/ A^hj 



7 ^ ^«,i 



m,; 



n + rui — 1 



mi 



n 



transpositions. But then 7r(r2)~"'^7r(Ti) admits a decomposition into 2(n— 1) transpositions. 
This entails that every decomposition of this permutation into a product of transpositions 
must involve an even number of them, and therefore that TD'(Ti,T2) is an even integer. 
As far as the stated upper bound for TD'{Ti,T2) goes, notice that 7r(T2)~^7r(ri) moves 
at most 2n—2 elements and that if it is not the identity, then its decomposition into disjoint 
cycles has at least 1 cycle. Therefore, again by Remark |21 a minimal decomposition of this 
permutation into transpositions will involve at most {2n — 2) — 1 transpositions, and since 
this number is even, this implies that TD'(Ti,T2) ^ 2n — 4. 

In other words, TD' is "artificially" multiplied by 2. Thus, we define a new metric on 
T„ by dividing TD' by 2. 

Definition 4. The transposition distance on 7^ is 

TD -.TnXTn^n 

{Ti,T2)^\TD'{TuT2) 

In this way, TD takes values in {0, 1, 2, . . . , n — 2}. 
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Fig. 3. From left to right, the phylogenetic trees Ti, T2, T^, and T4 in Example |3 



Table 1. Transposition distances between pairs of trees Ti, . . . , T4. 
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Example 4- Let Ti,T2,T3,T4 be the phylogenetic trees displayed in Fig. EJ (which we al- 
ready give bottom-up ordered). Their matching permutations are 

7r(Ti) = (1,2)(3,4)(5,6), 7r(T2) = (1,3)(2,4)(5,6), 
7r(T3) = (1,2,3)(4,5), iriT^) = (1, 2)(3, 5)(4,6) 

(understood as permutations in SSq), and then 

(3,1)(4,2)(6,5)(1,2)(3,4)(5,6) = (1,4)(2,3) 
(3, 2, 1)(5, 4)(1, 2)(3, 4)(5, 6) = (2, 3, 5, 6, 4) 
(2,1)(5,3)(6,4)(1,2)(3,4)(5,6) = (3,6)(4,5) 
(3,2,1)(5,4)(1,3)(2,4)(5,6)=(1,2,5,6,4) 
(2,1)(5,3)(6,4)(1,3)(2,4)(5,6) = (1, 5,4)(3, 2, 6) 
(2, 1)(5, 3)(6, 4)(1, 2, 3)(4, 5) = (2, 5, 6, 4, 3) 

which yields the distances between these trees given in Tabled 

The transposition distance between two phylogenetic trees can be easily computed in 
linear time. To prove it, we move to the more general setting of permutations and the 
graphs associated to them. 

For every permutation vr G SSm, the directed graph associated to vr is the graph 
Gt, = ({l,...,m},(5^) with 

Qtv = {{i,j) \i+ 3 and 7r(i) = j). 

The directed graph G^-\ associated to the inverse vr^^ of a permutation vr is obtained by 
reversing all arrows in G-,^: thus, Qtt-i = Q^^ and G^-\ = G~^. 

Given two permutations tti, 7r2 € SSm, by G-^-^ +G~^ we understand the 2-colored-arcs 
multigraph with set of nodes {!,..., m}, set of red arcs Qtti and set of blue arcs Q^^- We 
shall say that a node of G^ri + G~^ is unbalanced when it is isolated in one, and only 
one, of the graphs Gt^ , G:^^ (which means that it is fixed by one, and only one, of the 
permutations vri,7r2). 



Proposition 3. For every unbalanced node i of G,, 



^772 •■ 



(1) If i is isolated in G-^^ ^'^'^ (^O;^); (^)^i) ^ Qni with io 7^ ii, then replacing the red arcs 
{io,i) and {i,ii) by a single red arc {iQ,ii) increases dtransi'^i-,'^2) by 1. 



(2) If i is isolated in G^j '^'^^ (^)^i)) (^i)^) ^ Qttu removing the red arcs {i,ii) and {ii,i) 
increases dtransiT^i, 7^2) by 1. 

(3) Similar properties hold if i is isolated in Gt^ hut not in 0^2 and we modify the set of 
blue arcs. 

Proof. (1) If {io,i), {i,ii) G Qm, with io 7^ ii, then io = 7r]~ (i) and ii = 7ri(i) and hence 
(i,ii)7ri(io) = ii, (z,ii)7ri(i) = i, and (i,ii)7ri(j) = 7ri(j) for every j ^ io,i. Therefore, 
replacing the arcs {io,i), {i,ii) by an arc {io,ii) is equivalent to replacing vri by {i,ii)TTi. 
So, it is enough to prove that, with the notations and assumptions of point (1), 

dtrans (7^1,7^2) = dtransdhk)'^!, 7^2) + 1- 

To prove this equality, notice that, since i is fixed by 7r2, tt^ tti sends zq to i and i to 
TTg" (ii): let us denote this last index by ji. 

If ii = ^0) then {io,i) is a cycle of TTg" tti and it appears in any decomposition of 
this permutation as a product of transpositions. But then both i and io are fixed by 
TT2 {{i, ii)TTi), and since 7r2' vri and vTg" ((i, ii)'7ri) act exactly in the same way on the other 
elements, we deduce that 

TT^^TTl = {io,i){TT2^{{i,ii)7:i)) 

and then dtrans (tti, 7r2) = dtransiii,k)'^i,'^2) + 1 in this case. 

If ii 7^ ^0) then the cycle of 7r2^ vri moving iq has at least three elements: 

{io,hJi,J2,---,js), with s ^ 1, 

and thus it contributes s + 1 transpositions to a minimal decomposition of tt^ tti as a 
product of transpositions. Now, the cycle of vr^ {{i,ii)TTi) that moves io is 

(io,ii,j2,---,js), 

and it only contributes s transpositions to any decomposition of tt2 {{i, ii)'^i) as a product 
of transpositions. Therefore, dtrans {t^i 17^2) = dtransiii, h)"^!, ^^2) + 1 also in this case. 
(2) If {i,ii), {ii,i) G Qm, then tt^ (i) = 7ri(i) = zi, and hence 

{i,ii)TTi{i) = i, {i,ii)TTi{ii) = ii, 

and {i,ii)TTi{j) = 7ri(j) for every j ^ i,ii. Therefore, to remove the arcs {ii,i),{i,ii) in 
this case means again to replace vri by {i, ii)iTi. So, again in this case, it is enough to prove 
that, with the notations and assumptions of point (2), 

dtransi'^1,'^2) = dtrans{ihk)'^l,'^2) + 1- 

Since i is fixed by tt2, we have that 7r2" vri sends ii to i and i to 7r2" (ii): let us denote 
this last index by ji . 



If j'l = ii, i.e., if ii is also fixed by 7r2, then (i,ii) is a cycle of vr^ vri and it appears 
in any decomposition of this permutation as a product of transpositions. But then both i 
and ii are fixed by vr^ {{i,ii)7ri) and 

and then dtrans (tti, vr2) = (itrans((i, «i)7ri,7r2) + 1. 

Ii ji 7^ ii, then the cycle of tt^ tti moving ii has at least three elements: 

{ii,i,ji,...,js), with s ^ 1, 

and thus it contributes s + 1 transpositions to any decomposition of tt2 tti as a product of 
transpositions. Now, i is fixed by vr^ ((i,ii)7ri) and the cycle of this permutation moving 

ii is 

(n,ii,i2,---,is), 

and it only contributes s transpositions to any decomposition of tt^ ((?, ii)TTi)) as a product 
of transpositions. Thus, again in this case, dtrans{'^i,'^2) = dtrans{ihh)'^i,'^2) + 1- Q 

Proposition 4. // Gt^ + G^^ has no unbalanced node, then 

dtrans {'^1,7^2) = N{Gt,^,G~^ ) - ^(G^^jG^^ )' 

where N{G.,t^ , G~^) is the number of non-isolated nodes of Gt^ + G~^ and A{Gt^-i 1 ^n-l) ^-^ 
the number of alternating cycles in G^^ +G~^, i.e., of cycles in this directed 2- colored- arcs 
multigraph such that two consecutive arcs have different colors. 

Proof. If Gtti + G~^ has no unbalanced node, then every node either is isolated or has 
exactly one incoming and one outcoming arc of each color. This entails that Qtti U Qtt2 
decomposes into the union of arc-disjoint alternating cycles. 
Now, every length 2k alternating cycle 

ihji), (j'l, ^2), ii2,J2), (j2,i3), • • • , (ikjk), Uk,k), 

with {ii,ji) € Qm for every i = 1, . . . ,k and (ji, ii+i) G Qn2 ioi every £ = 1, . . . ,k — l and 
ijkih) £ Q'K2-> corresponds to a length k cycle 

(ii,i2,...,ifc) 

of 1^2 VTi and hence it adds k — 1 transpositions to any decomposition into transpositions 

of this permutation. 

Therefore, if we denote by A{Gt^^ , G~^) the set of alternating cycles in G^rj + G~^ , we 

have that 

v^ /length(G) 

«trans(,7ri,Vr2J = 2_^ ( 

= 1 Y. length(G)-|^(G.„G-i: 

= ^\Q^,uQ^,\-\AiG^„G-^)\. 



begin 

Compute the bottom-up orderings of Ti and T2 

Compute the matching representation M{T\) and the directed graph 
Gi = ({1, . . . , 2ri - 2}, Qi) associated to 7r(ri) 

Compute the matching representation M{T2) and the directed graph 
G2 = ({1, . . . , 2n - 2}, Q2) associated to ^(Tb)"^ 

d:=0 

A^ := largest number appearing in M{T{) or M(T2) 

while Gi + G2 has unbalanced nodes do 

foreach angle {(Jo,i), (i,ii)} in Qi with i unbalanced and io 7^ Ji do 

Qi — (Qi - {(io,J),(2.*i)})U{(Jo,ii)} 

d--d+l 

N ■- N ~1; 
foreach angle {{io,i), (i,ii)} in Q2 with i unbalanced and io 7^ ii do 

Q2 ■- (Q2 - {(io, i), (i, ii)}) U {(io, ii)} 

d — d+1 

N ■- N -1 
foreach {{i,ii), (zi,i)} in Qi with i unbalanced do 

Qi ~ Qi - {{i,ii),{ii,i)} 

d — d+1 

7V:=A'^ — lifjiis not unbalanced, N :— N — 2 otherwise 

foreach {{i,ii), (ii,i)} in Q2 with i unbalanced do 

Q2 ■- Q2 - {(i,«i),(ii,i)} 

d — d+1 

7V:=Af — lifiiis not unbalanced, N :— N — 2 otherwise 

Compute the number A of alternating cycles in the resulting directed multigraph Gi + G2, by 
traversing them 

TD{Ti,T2) ■- {d + N -A)/2 
end 



Algorithm 2: Transposition distance. Given phylogenetic trees Ti,T2 S 7^, the 
algorithm computes the transposition distance TD{Ti,T2). 



Finally, it is straightforward to notice that if G^rj + G^^ has no unbalanced node, then 
IQttiI = \Q-K2\ ^^'^ it is equal to the number of non-isolated nodes in this multigraph. D 

These propositions allow us to compute TD(Ti,T2), for Ti,T2 & Tn, in time linear on 
n using the procedure given in pseudocode in Algorithmic 



Remark 3. If Ti and T2 are two phylogenetic trees with different sets of labels, then we 
can compute their transposition distance by first restricting them to the sets of leaves with 
common labels, and then relabeling consecutively these common labels, starting with 1. 
Since we do not allow outdegree 1 nodes, when we restrict a phylogenetic tree to a subset 
of its set of taxa we contract edges to remove outdegree 1 nodes. 
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Fig. 4. A phylogenetic tree and the botom-up ordering of its restriction to the taxa of the 
tree in Fig. ^ 



Example 5. Let Ti be the phylogenetic tree in Example Hand let T2 be the lower phyloge- 
netic tree displayed in Fig.|lJ which represents the bottom- up ordering (with its taxa sorted 
alphabetically) of the tree T270c2x3x96cl2c57c27 in TreeBASE after removing the outer 
taxon Dalbergia (and the elementary root created in this way), which does not appear in 
Ti. Its matching permutation is 



7r(r2) = (4, 6)(7, 5)(1, 12)(10, 11)(9, 14)(13, 15)(2, 16)(8, 17)(3, 18). 



Since 



7r(ri) = (1, 5, 7, 9)(4, 6, 10)(2, 11)(8, 13)(3, 12, 14), 

(see Example OJ, the multigraph G^(Ti) + C!~,j, ^ has nodes {1, . . . , 18}, red arcs (1,5), 
(5,7), (7,9), (9,1), (4,6), (6,10), (10,4), (2,11)' (11,2), (8,13), (13,8), (3,12), (12,14), 
and (14, 3), and blue arcs (4, 6), (6,4), (7,5), (5, 7), (1, 12), (12,1), (10,11), (11,10), (9,14), 
(14,9), (13,15), (15,13), (2,16), (16,2), (8,17), (17,8), (3,18), and (18,3). 
To compute TD{Ti,T2), we start with d = and N = 18. 

1. At the beginning, 15, 16, 17 and 18 are unbalanced. Then, we remove the pairs of blue 
arcs {(13, 15), (15, 13)}, {(2, 16), (16, 2)}, {(8, 17), (17, 8)}, and {(3, 18), (18,3)} and we 
set d = 4 and N = 14. 



2. In this way, the nodes 2, 3, 8, 13 become unbalanced. Then, we remove the pah's of red 
arcs {(2, 11), (11, 2)}, {(8, 13), (13, 8)} and we replace the pair of red arcs (14, 3), (3, 12) 
by a new red arc (14, 12) and we set d = 7 and A^ = 10. 

3. Now, 11 has become unbalanced. Then, we remove the pair of blue arcs {(10,11), 
(11, 10)} and we set d = 8 and iV = 9. 

4. Now, 10 has become unbalanced. Then, we replace the pair of red arcs (6, 10), (10,4) 
by a new red arc (6, 4) and we set d = 9 and N = 8. 

5. At this moment, there does not remain any unbalanced node: the resulting multigraph 
has 5 alternating cycles (a cycle (1,5,7,9,14,12,1), a cycle (1,12,14,9,1), a cycle 
(5,7,5), and two cycles (4,6,4)). Then, we have 

TD{Ti,T2) = ^(d + N-5^ =6. 

In the Introduction we mentioned that the transposition distance defined in this pa- 
per generalizes the transposition distance for fully resolved trees. This will be a direct 
consequence of the following result. 

Proposition 5. For every pair of binary phylogenetic trees Ti, T2 G 7^, let G = {V, E) be 
the undirected multigraph with V = {1, . . . , 2n — 2} and E = M(Ti) U M{T2), and let C 
be the set of connected components of G. Then, TD[Ti,T2) = n — 1 — \C\. 

Proof. Let Gi and G2 denote the directed graphs associated to '7r(Ti) and 7r(T2)^^. Since 
Ti and T2 are binary, in Gi + G2 for every blue or red arc (i, j) there is the inverse arc 
(j, i) of the same color, and the graph G in the statement is the undirected graph obtained 
by replacing each pair of arcs of the same color {(i, j), (j, i)} by the undirected edge {i, j}, 
which we shall understand colored with the same color as the original pair. 

Since Ti , T2 € 7^ are binary, and therefore they have 2n — 1 nodes, no one of the 2n — 2 
nodes of Gi + G2 is unbalanced or isolated. Then, by Proposition |3 

ra(Ti,r2) = ^((2n-2)-^(Gi,G2))=n-l-i^(Gi,G2). 

Moreover, G is 2-regular, and therefore, every connected component in G is an alternating 
cycle, which contains exactly two alternating cycles of Gi + G2. Therefore ^(Gi,G2) = 
2|G[. Combining this equality with the expression for TD{Ti,T2) given by Proposition |1J 
we obtain the expression in the statement. D 

In ^7j, the transposition distance between two binary phylogenetic trees Ti and T2 was 
defined as the least number of transpositions necessary to transform M(Ti) into M{T2): in 
this context, a transposition means a replacement of a pair of 2-elements sets {^, j}, {k, 1} 
by a new pair {i, A;}, {j, /}. Theorem 1 in loc. cit. and the last proposition entail that, 
for binary phylogenetic trees, our transposition distance and the transposition distance 
defined in ^7j are the same. 



4 Results 



We have implemented in Perl the algorithms for the transposition distance between phylo- 
genetic trees, using the BioPerl collection of Perl modules for computational biology I15j. 
The software is available in source code form for research use to educational institu- 
tions, non-profit research institutes, government research laboratories, and individuals, 
for non-exclusive use, without the right of the licensee to further redistribute the source 
code. The software is also provided for free public use on a web server, at the address 
http : //www . Isi . upc . edu/~valiente/ 

Using this implementation, we have performed a systematic study of the TreeBASE [7] 
phylogenetic database, the main repository of published phylogenetic analyses, which cur- 
rently contains 2,592 phylogenies with 36,593 taxa among them. Previous studies have 
revealed that TreeBASE constitutes a scale-free network llC 
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Fig. 5. Similarity of phylogenetic trees in TreeBASE based on the transposition distance. 
Each bullet represents the distance between a phylogenetic tree and the most similar 
phylogenetic tree in TreeBASE (other than itself) with at least three common taxa. 



In order to assess the usefulness of the new distance measure in practice, we have 
computed the transposition distance for each of the 2, 592 • 2, 591/2 = 3, 357, 936 pairs of 
phylogenetic trees in TreeBASE. Then, for each phylogenetic tree, we have recovered the 
most similar phylogenetic tree in TreeBASE (other than itself) with at least three taxa in 
common. The results, summarized in Fig. El show that the transposition distance allows 
for a good recall of similar phylogenetic trees. 
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